Text Mining and Sentiment Analysis in R

Instructor: Aleszu Bajak

Introduction

This O’Reilly course will introduce participants to the techniques and applications of text mining and sentiment analysis by training them in easy-to-use open-source tools and scalable, replicable methodologies that will make them stronger data scientists and more thoughtful communicators.

Using RStudio and several engaging and topical datasets sourced from politics, social science, and social media, this course will introduce techniques for collecting, wrangling, mining and analyzing text data.

The course will also have participants derive and communicate insights based on their textual analysis using a set of data visualization methods. The techniques that will be used include n-gram analysis, sentiment analysis, and parts-of-speech analysis.

By the end of this live, hands-on, online course, you’ll understand:

And you’ll be able to:

Requirements and accessing data

Ideally, participants will have the latest versions of R and RStudio and the tidytext and tidyverse packages. To access all R scripts, participants should next download this Github repository and set it as their working directory in RStudio.

This course can also be accessed on RStudio Cloud here, though a free account is required.

Table of contents

  1. Course outline
  2. Datasets used
  3. Text analysis in the wild
  4. Text analysis methods
  5. Sentiment analysis methods
  6. Visualization and communication
  7. Final activity

Course outline

Datasets used

Text as data

Text mining is all about making sense of text. That could mean counting the frequency of specific words, understanding the overall sentiment of a document, or applying statistical techniques to draw big-picture conclusions from a corpus. Whether one is analyzing social media posts, customer reviews or news articles, these techniques can be essential to understanding and deriving meaningful insights.

Note: Though there are several ways to mine data and perform sentiment analysis in R – with packages such as tm, quanteda, udpipe, and sentimentr – this course uses R’s tidytext package, developed by Julia Silge and David Robinson, and several tidy tools found in the tidyverse package.

Text analysis in the wild

BuzzFeed’s analysis of U.S. State of the Union speeches over time is a great example of text analysis. As an added bonus, journalist Peter Aldhous shared all his data and open-sourced his methodology as an Rmarkdown document. Related: New York Times science graphics editor Jonathan Corum also has a cool State of the Union visualization tool on his website.

img img img

The New York Times’ Mueller Report citations article is another example of a text analysis in mainstream media, used to explain which of and how often Trump’s associates appeared in the report. Check out my Storybench tutorial that includes R code for mining the Mueller Report for specific keywords.

img

img

FiveThirtyEight published an analysis tallying the instances of the name “Trump” in 2020 candidate messaging. The dataset was 2020 candidate emails sent to subscribers.

img

img

The Boston Globe’s Arresting Words investigation visualized transcripts of police arrests to isolate tops words uttered by those being hauled in.

img img img

Text analysis in marketing

Crimson Hexagon, recently acquired by Brandwatch, delivers “actionable social insights for the enterprise,” i.e. how is Under Armour clothing or 5-hour Energy Drink being discussed online?

img img

Sentiment analysis in the wild

FiveThirtyEight applied sentiment analysis to Reddit comments to assess the overall “sadness” of baseball fans summarized by team.

img

img

In Roll Call, I published a sentiment analysis of tweets by politicians in the run-up to the 2018 Midterms. We’ll get into this data and analysis later in this course.

img

img

img

img

FiveThirtyEight used sentiment analysis to help contrast presidential inauguration speeches. Does the “More positive words” annotation and x-axis ticks make this graphic easier to understand?

img

img

Sentiment analysis in finance

Bloomberg routinely analyzes Twitter sentiment surrounding keywords, companies and entities, such as this 2017 Vodaphone analysis, to better inform the trading strategies of its clients.

img

img

J.P. Morgan has published about sentiment analysis it has applied to analyst reports and news articles to assess the relationship between stock trades and news sentiment.

img

img

Text mining for fun

The community of tidytext users is large and very open to sharing code. Here are some examples of informal text mining and sentiment analyses that have been popoular on the Internet: craft beer reviews and Harry Potter books.

img

img

img

img

Discussion

Question 1: What real-world text analysis projects stuck out to you as memorable? Why? What was harder to get your head around and why?

Question 2: Choose one of the projects presented and write out some potential caveats, assumptions and/or problems faced with data collection, analysis or communication. Share via the group chat.

Text analysis methods

This section will introduce methods for tokenization, n-gram analysis and parts-of-speech analysis. We will then conduct a brief text analysis activity to isolate top words, top phrases and top parts of speech for a dataset.

Tokenization

First let’s do some basic text ingestion and analysis using tidytext functions like unnest_tokens() for tokenizing and count() for, well, counting.

# Load packages
#install.packages("tidyverse")
#install.packages("tidytext")
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)

# Text and tokenization
line <- c("The quick brown fox jumps over the lazy dog.") 
line 
## [1] "The quick brown fox jumps over the lazy dog."
line_tbl <- as_tibble(line) # get into Tidy format
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
line_tbl
## # A tibble: 1 x 1
##   value                                       
##   <chr>                                       
## 1 The quick brown fox jumps over the lazy dog.
line_tokenized <- line_tbl %>%
  unnest_tokens(word, value)  # tokenize!
line_tokenized
## # A tibble: 9 x 1
##   word 
##   <chr>
## 1 the  
## 2 quick
## 3 brown
## 4 fox  
## 5 jumps
## 6 over 
## 7 the  
## 8 lazy 
## 9 dog
line_tokenized %>% count(word) # count
## # A tibble: 8 x 2
##   word      n
##   <chr> <int>
## 1 brown     1
## 2 dog       1
## 3 fox       1
## 4 jumps     1
## 5 lazy      1
## 6 over      1
## 7 quick     1
## 8 the       2
line_tokenized %>% count(word, sort=TRUE) # count and sort
## # A tibble: 8 x 2
##   word      n
##   <chr> <int>
## 1 the       2
## 2 brown     1
## 3 dog       1
## 4 fox       1
## 5 jumps     1
## 6 lazy      1
## 7 over      1
## 8 quick     1

Let’s also remove the stop words using the anti_join() function, which is well explained here. As the slide shows, anti_join() removes everything that matches a specified column in a supplied table. We’ll use left_join() later in this course. SQL and Python users will know this as a merge function.

img

img

We can always inspect this with glimpse().

line_clean <- line_tokenized %>% 
  anti_join(stop_words) # cut out stopwords 
## Joining, by = "word"
line_clean 
## # A tibble: 6 x 1
##   word 
##   <chr>
## 1 quick
## 2 brown
## 3 fox  
## 4 jumps
## 5 lazy 
## 6 dog
glimpse(stop_words) # let's inspect stopwords. 
## Observations: 1,149
## Variables: 2
## $ word    <chr> "a", "a's", "able", "about", "above", "according", "acco…
## $ lexicon <chr> "SMART", "SMART", "SMART", "SMART", "SMART", "SMART", "S…

Question: Are all of these stop words really worth cutting out? Can you find one that you want to include in your analysis?

n-gram analysis

Google’s n-gram viewer is probably the most well-known example of n-gram analysis.

img

img

We’ll use President Donald Trump’s 2018 State of the Union speech to explore 1-grams, 2-grams and 3-grams.

First, we’ll ingest the text file, tidy the data and then tokenize the text.

trump_speech <- read_file("trump2018.txt") # alternatively, use read_file(file.choose())

trump_speech_tbl <- as_tibble(trump_speech) # tidy the data
trump_speech_tbl
## # A tibble: 1 x 1
##   value                                                                    
##   <chr>                                                                    
## 1 "Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of…
trump_counts <- trump_speech_tbl %>%
  unnest_tokens(word, value) # tokenize
trump_counts # 5,864 words
## # A tibble: 5,864 x 1
##    word     
##    <chr>    
##  1 mr       
##  2 speaker  
##  3 mr       
##  4 vice     
##  5 president
##  6 members  
##  7 of       
##  8 congress 
##  9 the      
## 10 first    
## # … with 5,854 more rows

We’ll next use the count() function we previously introduced. Wow, that’s a lot of and’s, the’s and to’s. Let’s remove the stop words and count the words. You can change the “n=” argument in head() to customize the output. Note: I like to save my progress by writing CSVs of my tables. You can do this with “write.csv(trump_counts,”trump_counts.csv“)”

trump_counts <- trump_speech_tbl %>% 
  unnest_tokens(word, value) %>%
  count(word, sort = TRUE) %>% # count words
  glimpse()
## Observations: 1,674
## Variables: 2
## $ word <chr> "and", "the", "to", "of", "we", "in", "our", "a", "that", "…
## $ n    <int> 247, 237, 197, 138, 131, 105, 104, 98, 68, 57, 55, 48, 47, …
trump_counts <- trump_speech_tbl %>%
  unnest_tokens(word, value) %>%
  anti_join(stop_words) %>% # remove stopwords
  count(word, sort=TRUE) %>%
  glimpse()
## Joining, by = "word"
## Observations: 1,339
## Variables: 2
## $ word <chr> "people", "american", "americans", "tonight", "america", "c…
## $ n    <int> 33, 32, 24, 23, 22, 18, 15, 13, 13, 12, 11, 10, 9, 9, 9, 8,…
head(trump_counts)
## # A tibble: 6 x 2
##   word          n
##   <chr>     <int>
## 1 people       33
## 2 american     32
## 3 americans    24
## 4 tonight      23
## 5 america      22
## 6 country      18
head(trump_counts, n=15)
## # A tibble: 15 x 2
##    word               n
##    <chr>          <int>
##  1 people            33
##  2 american          32
##  3 americans         24
##  4 tonight           23
##  5 america           22
##  6 country           18
##  7 tax               15
##  8 congress          13
##  9 time              13
## 10 home              12
## 11 world             11
## 12 family            10
## 13 administration     9
## 14 nation             9
## 15 united             9

Ok, moving on to bigrams. Let’s use unnest_token()’s “token” and “n” arguments. Then we’ll count the bigrams and view them with head().

# Bigrams
bigrams <- trump_speech_tbl %>%
  unnest_tokens(bigram, value, token = "ngrams", n = 2) %>%
  count(bigram, sort=TRUE)
head(bigrams, n=20) # too many stopwords!
## # A tibble: 20 x 2
##    bigram           n
##    <chr>        <int>
##  1 in the          27
##  2 we are          24
##  3 of the          19
##  4 thank you       19
##  5 we have         19
##  6 and the         11
##  7 and we          11
##  8 on the          11
##  9 for the         10
## 10 of our          10
## 11 our country     10
## 12 congress to      9
## 13 the people       9
## 14 the united       9
## 15 the world        9
## 16 to be            9
## 17 to the           9
## 18 we will          9
## 19 i am             8
## 20 the american     8

This works but it’s not terribly insightful because of all the stop words. Let’s remove stop words according to the way Julia Silge and David Robinson suggest in Text Mining with R.

# Better bigrams
trump_bigrams <- trump_speech_tbl %>%
  unnest_tokens(bigram, value, token = "ngrams", n = 2) 

bigrams_separated <- trump_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") # separate bigram by space

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% # filter out stopwords from word1 column
  filter(!word2 %in% stop_words$word) # filter out stopwords from word2 column

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE) # count new bigrams

bigram_counts 
## # A tibble: 722 x 3
##    word1       word2      n
##    <chr>       <chr>  <int>
##  1 american    people     5
##  2 ms          13         5
##  3 north       korea      4
##  4 seong       ho         4
##  5 tax         cuts       4
##  6 immigration system     3
##  7 kenton      stacy      3
##  8 tax         cut        3
##  9 13          gang       2
## 10 american    dream      2
## # … with 712 more rows

Now we see some interesting bigrams like “MS 13,” “North Korea” and “immigration system.”

Question: How would you export this table as a CSV? Can you write the function in R?

Finally, let’s change the “n” argument to “3” and count the trigrams.

# Trigrams
trump_trigrams <- trump_speech_tbl %>%
  unnest_tokens(trigram, value, token = "ngrams", n = 3) 

trigrams_separated <- trump_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") # separate bigram by space

trigrams_filtered <- trigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% # filter out stopwords from word1 column
  filter(!word2 %in% stop_words$word) %>% # filter out stopwords from word2 column
  filter(!word3 %in% stop_words$word) # filter out stopwords from word3 column

trigram_counts <- trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE) # count new bigrams

trigram_counts 
## # A tibble: 235 x 4
##    word1 word2    word3         n
##    <chr> <chr>    <chr>     <int>
##  1 ms    13       gang          2
##  2 1,500 va       employees     1
##  3 1.8   million  illegal       1
##  4 11    months   ago           1
##  5 13    horrible people        1
##  6 1996  seong    ho            1
##  7 20    straight minutes       1
##  8 2016  american taxpayers     1
##  9 220   ms       13            1
## 10 3     million  workers       1
## # … with 225 more rows

Question 1: How would you summarize these results for a non-technical audience? Could you design a top 10 table with your exported CSV and embed it alongside your code?

Question 2: What would your headline be for this 2018 State of the Union speech, based on these n-grams, bigrams and/or trigrams?

Doing string calculations

The stringr package brings together loads of useful tools for string manipulation and calculation. Below, run through the code to see how str_length() can be used to calculate the length of strings. Note: str_length() also counts spaces and punctuation.

# Calculate length of strings 
line <- c("The quick brown fox jumps over the lazy dog.") 
line 
## [1] "The quick brown fox jumps over the lazy dog."
library(stringr)
str_length(line) 
## [1] 44

Next, we’ll pull in the State of the Union speeches CSV file and create a new object with a new column “length.” In this new column, we’ll store the length of each speech we calculate with str_length(). Using the ggplot2 package we can plot a scatterplot of date vs. length to visualize the length of the speeches over time. We’ll return to more data visualization with ggplot2 soon.

sou <- read_csv("sou.csv")
## Parsed with column specification:
## cols(
##   link = col_character(),
##   president = col_character(),
##   message = col_character(),
##   date = col_date(format = ""),
##   text = col_character()
## )
## Warning: 20 parsing failures.
## row  col           expected actual      file
## 232 text delimiter or quote      I 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      D 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      C 'sou.csv'
## ... .... .................. ...... .........
## See problems(...) for more details.
glimpse(sou)
## Observations: 232
## Variables: 5
## $ link      <chr> "http://www.presidency.ucsb.edu/ws/index.php?pid=29431…
## $ president <chr> "George Washington", "George Washington", "George Wash…
## $ message   <chr> "First Annual Address to Congress", "Second Annual Add…
## $ date      <date> 1790-01-08, 1790-12-08, 1791-10-25, 1792-11-06, 1793-…
## $ text      <chr> "Fellow-Citizens of the Senate and House of Representa…
length_of_sous <- sou %>%
  mutate(length = str_length(text))
glimpse(length_of_sous)
## Observations: 232
## Variables: 6
## $ link      <chr> "http://www.presidency.ucsb.edu/ws/index.php?pid=29431…
## $ president <chr> "George Washington", "George Washington", "George Wash…
## $ message   <chr> "First Annual Address to Congress", "Second Annual Add…
## $ date      <date> 1790-01-08, 1790-12-08, 1791-10-25, 1792-11-06, 1793-…
## $ text      <chr> "Fellow-Citizens of the Senate and House of Representa…
## $ length    <int> 6678, 8378, 14108, 12683, 11624, 17561, 12245, 17289, …
# Plot it 
ggplot(length_of_sous, aes(date, length)) +
  geom_point()

We can also create something akin to the Google n-gram viewer by “searching” through the speeches using “str_count” and calculating the number of times a specific keyword or phrase appears. We can then plot the output as a line chart.

# Search for a string with "str_detect"
speeches_w_keyword <- sou %>%
  group_by(text, date, president, message) %>%
  mutate(count = str_count(text, "health care")) # try "people" or "crime" 
speeches_w_keyword
## # A tibble: 232 x 6
## # Groups:   text, date, president, message [232]
##    link           president   message     date       text             count
##    <chr>          <chr>       <chr>       <date>     <chr>            <int>
##  1 http://www.pr… George Was… First Annu… 1790-01-08 Fellow-Citizens…     0
##  2 http://www.pr… George Was… Second Ann… 1790-12-08 Fellow-Citizens…     0
##  3 http://www.pr… George Was… Third Annu… 1791-10-25 "Fellow-Citizen…     0
##  4 http://www.pr… George Was… Fourth Ann… 1792-11-06 Fellow-Citizens…     0
##  5 http://www.pr… George Was… Fifth Annu… 1793-12-03 "Fellow-Citizen…     0
##  6 http://www.pr… George Was… Sixth Annu… 1794-11-19 "Fellow-Citizen…     0
##  7 http://www.pr… George Was… Seventh An… 1795-12-08 Fellow-Citizens…     0
##  8 http://www.pr… George Was… Eighth Ann… 1796-12-07 Fellow-Citizens…     0
##  9 http://www.pr… John Adams  First Annu… 1797-11-22 Gentlemen of th…     0
## 10 http://www.pr… John Adams  Second Ann… 1798-12-08 Gentlemen of th…     0
## # … with 222 more rows
# Plot it
ggplot(speeches_w_keyword, aes(date,count)) +
  geom_line(stat="identity")

Question: Could you imagine an interactive app that allows users to search a dataset for specific keywords? What dataset would you use? What would the user interface look like? Shiny apps in R can be built to do this. See this prototype of mine and this documentation on Shiny apps from RStudio.

Parts-of-speech analysis

Adjectives can drive the tone of a sentence. Or a tweet. That’s where parts-of-speech analysis comes in.

Let’s look at Donald Trump’s adjective use on Twitter to illustrate the mechanics of parts-of-speech analysis. We’ll also try to visualize the results in a couple different ways. These tweets were collected with R’s “retweet” package and some great tutorials can be found here.

First, we’ll pull in the dataset of tweets collected for the month of July, 2019. Then we’ll tokenize and remove stop words. The very next operation uses inner_join() to merge in the “parts_of_speech” dataset, which comes with tidytext and includes more than 208,000 word/pos combinations.

What remains are the words from Trump’s tweets that can be tagged with a specific part of speech. If we isolate adjectives, we can glimpse() the results and plot them as a bar chart.

Trump <- read_csv("Trump_tweets.csv")
## Parsed with column specification:
## cols(
##   created_at = col_datetime(format = ""),
##   screen_name = col_character(),
##   text = col_character(),
##   source = col_character(),
##   retweet_count = col_double(),
##   favorite_count = col_double()
## )
glimpse(Trump)
## Observations: 399
## Variables: 6
## $ created_at     <dttm> 2019-07-29 17:29:30, 2019-07-29 15:55:39, 2019-0…
## $ screen_name    <chr> "realDonaldTrump", "realDonaldTrump", "realDonald…
## $ text           <chr> "Looking forward to my meeting at 2:00 P.M. with …
## $ source         <chr> "Twitter for iPhone", "Twitter for iPhone", "Twit…
## $ retweet_count  <dbl> 6027, 6592, 3985, 4511, 9847, 11026, 16870, 21652…
## $ favorite_count <dbl> 27028, 0, 0, 0, 44084, 47444, 81508, 92406, 45148…
Trump_tokenized_pos <- Trump %>%
  unnest_tokens(word, text) %>% # tokenize the headlines
  anti_join(stop_words) %>%
  inner_join(parts_of_speech) # join parts of speech dictionary
## Joining, by = "word"
## Joining, by = "word"
parts_of_speech
## # A tibble: 208,259 x 2
##    word    pos      
##    <chr>   <chr>    
##  1 3-d     Adjective
##  2 3-d     Noun     
##  3 4-f     Noun     
##  4 4-h'er  Noun     
##  5 4-h     Adjective
##  6 a'      Adjective
##  7 a-1     Noun     
##  8 a-axis  Noun     
##  9 a-bomb  Noun     
## 10 a-frame Noun     
## # … with 208,249 more rows
glimpse(Trump_tokenized_pos)
## Observations: 6,626
## Variables: 7
## $ created_at     <dttm> 2019-07-29 17:29:30, 2019-07-29 17:29:30, 2019-0…
## $ screen_name    <chr> "realDonaldTrump", "realDonaldTrump", "realDonald…
## $ source         <chr> "Twitter for iPhone", "Twitter for iPhone", "Twit…
## $ retweet_count  <dbl> 6027, 6027, 6027, 6027, 6027, 6027, 6027, 6592, 6…
## $ favorite_count <dbl> 27028, 27028, 27028, 27028, 27028, 27028, 27028, …
## $ word           <chr> "forward", "forward", "forward", "forward", "meet…
## $ pos            <chr> "Adjective", "Noun", "Adverb", "Verb (transitive)…
Trump_adj <- Trump_tokenized_pos %>%
  group_by(word) %>% 
  filter(pos == "Adjective") %>%  # filter for adjectives
  count(word, sort = TRUE) %>% 
  glimpse()
## Observations: 360
## Variables: 2
## Groups: word [360]
## $ word <chr> "american", "left", "bad", "fake", "radical", "united", "mi…
## $ n    <int> 26, 21, 16, 16, 15, 15, 11, 10, 10, 10, 9, 9, 9, 9, 8, 8, 8…
head(Trump_adj, n=10)
## # A tibble: 10 x 2
## # Groups:   word [10]
##    word           n
##    <chr>      <int>
##  1 american      26
##  2 left          21
##  3 bad           16
##  4 fake          16
##  5 radical       15
##  6 united        15
##  7 military      11
##  8 anti          10
##  9 deal          10
## 10 republican    10
ggplot(head(Trump_adj, n=10), aes(reorder(word, n), n)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  xlab("")+ 
  coord_flip()

Q&A

Question: Based on the n-gram analysis, string-based calculations and parts-of-speech analysis we’ve performed, what other questions could you come up with for the State of the Union speeches, a politician’s tweets or another dataset might you want to compile for one of the methodologies practices above?

Sentiment analysis methods

Sentiment analysis is being widely applied to understand politics, finance or sports, as we saw from our examples analyzing the sentiment of social media from candidates running for Senate, news articles about a particular company or product, and Reddit comments from baseball fans.

While sentiment analysis can involve complex natural language processing models like word2vec, for the purposes of this course we’ll explore its simplest form: scoring individual words based on a dictionary of word/score pairs. Let’s look at some sentiment dictionaries.

# Sentiment dictionaries
afinn <- get_sentiments("afinn")
afinn 
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,466 more rows
bing <- get_sentiments("bing")
bing
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # … with 6,778 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows
labMT<- read.csv("labMT.csv")
head(labMT)
##        word score
## 1  laughter  8.50
## 2 happiness  8.44
## 3      love  8.42
## 4     happy  8.30
## 5   laughed  8.26
## 6     laugh  8.22
tail(labMT)
##            word score
## 10217     death  1.54
## 10218    murder  1.48
## 10219 terrorism  1.48
## 10220      rape  1.44
## 10221   suicide  1.30
## 10222 terrorist  1.30

Let’s score the sentiment of a very simple sentence using the afinn dictionary. We’ll ingest the sentence, tidy it, tokenize it and then use inner_join() to merge in the afinn word/score dictionary and leave only those words that have been scored. Finally, we’ll calculate the average score of the five words that were scored. (Notice that we started out with 10 words.)

alexander <- c("Alexander and the terrible, horrible, no good, very bad day")
alexander <- as.tibble(alexander) 
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
alexander_scored <- alexander %>%
  unnest_tokens(word, value) %>%
  inner_join(afinn, by="word") %>% 
  glimpse()
## Observations: 5
## Variables: 2
## $ word  <chr> "terrible", "horrible", "no", "good", "bad"
## $ score <int> -3, -3, -1, 3, -3
mean(alexander_scored$score)
## [1] -1.4

Ok, let’s go to a bigger dataset and do this at scale: the State of the Union speeches. Let’s merge in the afinn dictionary and then create a pivot table that summarizes the average sentiment score by president using the mean() function embedded in the summarise() function. Finally, we’ll plot the results as a bar chart.

sentiment_sou <- sou %>%
  unnest_tokens(word, text) %>%
  inner_join(afinn, by = "word") # we'll join in the AFINN dictionary
glimpse(sentiment_sou)
## Observations: 115,335
## Variables: 6
## $ link      <chr> "http://www.presidency.ucsb.edu/ws/index.php?pid=29431…
## $ president <chr> "George Washington", "George Washington", "George Wash…
## $ message   <chr> "First Annual Address to Congress", "First Annual Addr…
## $ date      <date> 1790-01-08, 1790-01-08, 1790-01-08, 1790-01-08, 1790-…
## $ word      <chr> "embrace", "great", "opportunity", "prospects", "impor…
## $ score     <int> 1, 3, 2, 1, 2, 1, 3, 2, 3, 2, 2, 2, 3, 1, 2, 1, 2, 1, …
sentiment_by_president <- sentiment_sou %>%
  group_by(president) %>% 
  summarise(avgscore = mean(score)) %>%
  arrange(desc(avgscore)) %>%
  glimpse()
## Observations: 42
## Variables: 2
## $ president <chr> "James Monroe", "Dwight D. Eisenhower", "John Quincy A…
## $ avgscore  <dbl> 0.8635972, 0.8315917, 0.8296199, 0.7834550, 0.7426254,…
ggplot(sentiment_by_president, aes(reorder(president, avgscore), avgscore)) +
  geom_col() +
  coord_flip()

Question 1: Do you have any idea why a particular president is in a particular spot? Why might FDR, for example, be near the bottom?

Question 2: How would you label the x-axis? What’s going to be most clear and who is your intended audience?

Let’s now look at the sentiment of State of the Union speeches over time. Instead of organizing by president, we can group_by() message and date. Plotting that as a scatterplot and fitting a linear regression to the data, we see that the speeches appear to be relatively stable, in terms of sentiment, across time.

sentiment_sou_afinn <- sentiment_sou %>%
  group_by(message, date) %>% # We "group by" message and date instead of by president
  summarise(avgscore = mean(score)) %>%
  glimpse()
## Observations: 232
## Variables: 3
## Groups: message [54]
## $ message  <chr> "7th Annual Message", "8th Annual Message", "Address Be…
## $ date     <date> 1919-12-02, 1920-12-07, 1993-02-17, 1986-02-04, 1987-0…
## $ avgscore <dbl> 0.3684211, 0.5151515, 0.4097363, 0.6190476, 0.6908397, …
ggplot(sentiment_sou_afinn, aes(date, avgscore)) +
  geom_point() + 
  geom_smooth(method="lm")

Q&A

Question: What labels or other insights might you want to communicate along with this chart?

Activity

You have the “labMT” dictionary. Swap it in for “afinn” in the inner_join() function and recreate the scatterplot with labMT-scored speeches.

Question: What difference do you notice when the speeches are scored with the labMT dictionary?

Bonus: Applying sentiment analysis to social media

Now, let’s recreate the sentiment analysis my graduate student Floris Wu and I performed for Roll Call on Senate candidate tweets in the lead-up to the 2018 midterms.

img

img

First, let’s pull in the tweets, which were gathered using the rtweet package. Full methodology is published at the bottom of the Roll Call article and on my Github account. For the purposes of this activity, I’ve added some extra columns including party affiliation and share of the vote won in the 2018 elections. This information is all in the alltweets.zip file.

Let’s do some exploratory data visualization and create a histogram using ggplot2. We can add custom colors and change the theme with a couple extra lines:

candidate_tweets <- read_csv("alltweets.zip")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   user_id = col_character(),
##   created_at = col_character(),
##   screen_name = col_character(),
##   text = col_character(),
##   source = col_character(),
##   hashtags = col_character(),
##   status_id = col_character(),
##   favorite_count = col_double(),
##   retweet_count = col_double(),
##   status_url = col_character(),
##   name = col_character(),
##   location = col_character(),
##   description = col_character(),
##   followers_count = col_double(),
##   friends_count = col_double(),
##   date = col_date(format = ""),
##   party = col_character(),
##   percent_of_vote = col_double()
## )
glimpse(candidate_tweets)
## Observations: 144,346
## Variables: 19
## $ X1              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ user_id         <chr> "x117501995", "x117501995", "x117501995", "x1175…
## $ created_at      <chr> "11/22/18 17:00", "11/22/18 0:00", "11/21/18 16:…
## $ screen_name     <chr> "SenatorCantwell", "SenatorCantwell", "SenatorCa…
## $ text            <chr> "I want to wish every Washingtonian a happy and …
## $ source          <chr> "TweetDeck", "TweetDeck", "Twitter Web Client", …
## $ hashtags        <chr> NA, NA, "MonumentsForAll", NA, "bipartisan veter…
## $ status_id       <chr> "x1065651119148302336", "x1065394427605082112", …
## $ favorite_count  <dbl> 66, 39, 31, 441, 30, 69, 31, 57, 359, 149, 278, …
## $ retweet_count   <dbl> 11, 9, 10, 114, 9, 20, 7, 12, 146, 33, 95, 62, 9…
## $ status_url      <chr> "https://twitter.com/SenatorCantwell/status/1065…
## $ name            <chr> "Sen. Maria Cantwell", "Sen. Maria Cantwell", "S…
## $ location        <chr> "Washington, DC", "Washington, DC", "Washington,…
## $ description     <chr> "U.S. Senator from Washington State | Tweets fro…
## $ followers_count <dbl> 220787, 220787, 220787, 220787, 220787, 220787, …
## $ friends_count   <dbl> 1619, 1619, 1619, 1619, 1619, 1619, 1619, 1619, …
## $ date            <date> 2018-11-22, 2018-11-22, 2018-11-21, 2018-11-21,…
## $ party           <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D"…
## $ percent_of_vote <dbl> 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, …
ggplot(candidate_tweets, aes(date, fill=party)) + 
  geom_histogram(stat = "count") +
  ylim(0, 500) +
  scale_fill_manual(values=c("#404f7c", "forestgreen", "#c63b3b")) +
  theme_minimal() 
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 16550 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).

Next, we tokenize and score the tweets with the labMT dictionary.

tokenized_tweets <- candidate_tweets %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  glimpse()
## Joining, by = "word"
## Observations: 2,030,635
## Variables: 19
## $ X1              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ user_id         <chr> "x117501995", "x117501995", "x117501995", "x1175…
## $ created_at      <chr> "11/22/18 17:00", "11/22/18 17:00", "11/22/18 17…
## $ screen_name     <chr> "SenatorCantwell", "SenatorCantwell", "SenatorCa…
## $ source          <chr> "TweetDeck", "TweetDeck", "TweetDeck", "TweetDec…
## $ hashtags        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ status_id       <chr> "x1065651119148302336", "x1065651119148302336", …
## $ favorite_count  <dbl> 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66, …
## $ retweet_count   <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, …
## $ status_url      <chr> "https://twitter.com/SenatorCantwell/status/1065…
## $ name            <chr> "Sen. Maria Cantwell", "Sen. Maria Cantwell", "S…
## $ location        <chr> "Washington, DC", "Washington, DC", "Washington,…
## $ description     <chr> "U.S. Senator from Washington State | Tweets fro…
## $ followers_count <dbl> 220787, 220787, 220787, 220787, 220787, 220787, …
## $ friends_count   <dbl> 1619, 1619, 1619, 1619, 1619, 1619, 1619, 1619, …
## $ date            <date> 2018-11-22, 2018-11-22, 2018-11-22, 2018-11-22,…
## $ party           <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D"…
## $ percent_of_vote <dbl> 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, 58.6, …
## $ word            <chr> "washingtonian", "happy", "safe", "thanksgiving"…
all_sentiment <- tokenized_tweets %>%  
  inner_join(labMT, by = "word") %>%
  group_by(status_id, name, party, followers_count, percent_of_vote) %>%  
  summarise(sentiment = mean(score)) %>% 
  arrange(desc(sentiment))  %>%
  glimpse()
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Observations: 141,450
## Variables: 6
## Groups: status_id, name, party, followers_count [141,450]
## $ status_id       <chr> "x1028811546242179073", "x1031029926328258560", …
## $ name            <chr> "Elizabeth Warren", "Karin Housley", "Karin Hous…
## $ party           <chr> "D", "R", "R", "D", "D", "D", "R", "D", "R", "R"…
## $ followers_count <dbl> 2132931, 16645, 16645, 32905, 32905, 32905, 4505…
## $ percent_of_vote <dbl> 62.1, 42.4, 42.4, 31.1, 31.1, 31.1, 48.3, 53.0, …
## $ sentiment       <dbl> 8.42, 8.42, 8.42, 8.42, 8.42, 8.42, 8.42, 8.42, …

Next, we create a pivot table that calculates the average score per candidate and plot that as a scatterplot with custom colors, a custom scale range for the points (whose size is mapped to the candidate’s number of followers), and trend lines for each party.

Your graphic should look very similar to the final visual that appeared with the article. Notice we can customize the axis labels using ggplot2’s xlab() and ylab().

final_pivot <- all_sentiment %>% 
  group_by(name, party, followers_count, percent_of_vote) %>% 
  summarise(avgscore = mean(sentiment)) %>% 
  glimpse()
## Observations: 76
## Variables: 5
## Groups: name, party, followers_count [76]
## $ name            <chr> "AG Patrick Morrisey", "Amy Klobuchar", "Bernie …
## $ party           <chr> "R", "D", "I", "D", "R", "R", "R", "D", NA, "D",…
## $ followers_count <dbl> 16596, 515718, 7971261, 1011267, 2050, 16928, 44…
## $ percent_of_vote <dbl> 46.3, 60.3, 67.7, 48.3, 38.5, 43.5, 33.5, 59.5, …
## $ avgscore        <dbl> 5.532597, 5.689082, 5.520354, 5.769119, 5.754489…
ggplot(final_pivot, aes(y=percent_of_vote, x=avgscore, color=party)) + 
  geom_point(aes(size=followers_count)) + 
  scale_size(name="", range = c(1.5, 8)) +
  geom_smooth(method="lm", se = FALSE) + 
  labs(title = "As they go low... we go lower?",
       subtitle = "Democrats with more negative tweets won more often in 2018",
       caption = "Source: Twitter") +
  scale_color_manual(values=c("#404f7c", "#34a35c", "#34a35c", "#c63b3b")) +
  xlab("Average sentiment of tweets") +
  ylab("Percent of vote in 2018 midterms") +
  theme_minimal()
## Warning: Removed 7 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).

Question: What caveats would you include if you were publishing with this kind of social media sentiment analysis? What tweet-level data would you want to inspect and mention that speaks to the shortcomings of these methods?

Visualization and communication

Data analysis and extracted insights are nothing if they aren’t communicated effectively. This section will introduce several visual formats to help improve your data-driven storytelling.

Exporting CSVs

Whenever I reach a stage in my data analysis where I want to save a snapshot of where I am, whether to share with a colleague or to keep to myself, I write my dataframe into a CSV. Setting the “row.names=FALSE” argument to TRUE, which is default, will add a unique number to every row.

sou <- read_csv("sou.csv")
## Parsed with column specification:
## cols(
##   link = col_character(),
##   president = col_character(),
##   message = col_character(),
##   date = col_date(format = ""),
##   text = col_character()
## )
## Warning: 20 parsing failures.
## row  col           expected actual      file
## 232 text delimiter or quote      I 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      D 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      C 'sou.csv'
## ... .... .................. ...... .........
## See problems(...) for more details.
write.csv(sou, "state-of-the-union-speeches.csv", row.names=FALSE)

Tables

Properly formatted and designed for comprehension, tables can go a long way in communicating data, whether that’s a ranked list or just a random selection of variables. You can create publication-ready tables in RStudio with packages like DT and formattable. Here’s a preview of what they look like.

#install.packages("DT")
#install.packages("formattable")

library(DT)
datatable(Trump_adj)
library(formattable)
formattable(head(Trump_adj, n=15),
            align =c("l","c"))
word n
american 26
left 21
bad 16
fake 16
radical 15
united 15
military 11
anti 10
deal 10
republican 10
disgusting 9
federal 9
illegal 9
real 9
national 8
formattable(head(Trump_adj, n=15),
            align =c("l","c"))
word n
american 26
left 21
bad 16
fake 16
radical 15
united 15
military 11
anti 10
deal 10
republican 10
disgusting 9
federal 9
illegal 9
real 9
national 8
formattable(head(Trump_adj, n=15), 
            align =c("l","r"),
            list(
  n = color_bar("lightblue")
                )
            )
word n
american 26
left 21
bad 16
fake 16
radical 15
united 15
military 11
anti 10
deal 10
republican 10
disgusting 9
federal 9
illegal 9
real 9
national 8

Wordclouds

Although wordclouds have gotten their fair share of critique, they can still be powerful. I like the ggwordcloud package because it allows a lot of customization, see here and here for more documentation.

Let’s take the State of the Union addresses and filter for Barack Obama’s most frequently used words. Then we’ll build two different wordclouds.

#install.packages("ggwordcloud")
library(ggwordcloud)

sou <- read_csv("sou.csv")
## Parsed with column specification:
## cols(
##   link = col_character(),
##   president = col_character(),
##   message = col_character(),
##   date = col_date(format = ""),
##   text = col_character()
## )
## Warning: 20 parsing failures.
## row  col           expected actual      file
## 232 text delimiter or quote      I 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      D 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      C 'sou.csv'
## ... .... .................. ...... .........
## See problems(...) for more details.
glimpse(sou)
## Observations: 232
## Variables: 5
## $ link      <chr> "http://www.presidency.ucsb.edu/ws/index.php?pid=29431…
## $ president <chr> "George Washington", "George Washington", "George Wash…
## $ message   <chr> "First Annual Address to Congress", "Second Annual Add…
## $ date      <date> 1790-01-08, 1790-12-08, 1791-10-25, 1792-11-06, 1793-…
## $ text      <chr> "Fellow-Citizens of the Senate and House of Representa…
sou_words_by_president <- sou %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, president, sort=TRUE)
## Joining, by = "word"
glimpse(sou_words_by_president)
## Observations: 172,429
## Variables: 3
## $ word      <chr> "government", "government", "government", "government"…
## $ president <chr> "Grover Cleveland", "Theodore Roosevelt", "William How…
## $ n         <int> 574, 528, 460, 452, 436, 398, 360, 359, 343, 336, 333,…
Obama_words <- sou_words_by_president %>%
  filter(president == "Barack Obama") %>%
  glimpse()
## Observations: 4,620
## Variables: 3
## $ word      <chr> "america", "people", "american", "jobs", "americans", …
## $ president <chr> "Barack Obama", "Barack Obama", "Barack Obama", "Barac…
## $ n         <int> 204, 196, 179, 179, 146, 130, 115, 114, 111, 103, 100,…
ggplot(head(Obama_words, n=25), aes(label = word, 
                                           size=n)) +
  geom_text_wordcloud() +
  scale_size_area(max_size = 14) +
  theme_minimal()
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes =
## boxes, : Some words could not fit on page. They have been placed at their
## original positions.

ggwordcloud(Obama_words$word, Obama_words$n, 
            min.freq = 10, 
            max.words=50)
## Some words could not fit on page. They have been removed.

Wordtrees

I love word trees, like this one from Jason Davies visualizing Bob Dylan lyrics.

img

img

Using the wordtreer package, which is a wrapper for Google’s Word Tree chart, we can recreate this visualization of Blowin’ in the Wind. Note: I used the datapasta package to paste these lyrics into RStudio with a simple keyboard shortcut that pastes data from the clipboard as “character vector, formatted vertically, one element per line.” Exactly what I needed!

#devtools::install_github("DataStrategist/wordTreeR")
library(wordtreer)

Dylan <- c("How many roads must a man walk down",
  "Before you call him a man?",
  "How many seas must a white dove sail",
  "Before she sleeps in the sand?",
  "How many times must the cannon balls fly",
  "Before they're forever banned?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind",
  "How many years can a mountain exist",
  "Before it's washed to the sea?",
  "How many years can some people exist",
  "Before they're allowed to be free?",
  "How many times can a man turn his head",
  "And pretend that he just doesn't see?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind",
  "How many times must a man look up",
  "Before he can see the sky?",
  "How many ears must one man have",
  "Before he can hear people cry?",
  "How many deaths will it take till he knows",
  "That too many people have died?",
  "The answer, my friend, is blowin' in the wind",
  "The answer is blowin' in the wind")

wordtree(text=Dylan,
         targetWord = "How",
         direction="suffix",
         Number_words = 20,
         fileName="dylan.html")
browseURL("dylan.html")

To create a similar data frame of sentences containing a specified keyword, I’ve found the following code useful:

sou_sentences <- sou %>%
  filter(president == "Barack Obama") %>%
  unnest_tokens(sentences, text, token="sentences", to_lower = FALSE) %>%
  filter(str_detect(sentences, "terror")) %>%
  select(sentences)
glimpse(sou_sentences)
## Observations: 36
## Variables: 1
## $ sentences <chr> "We're negotiating an agreement with the Afghan Govern…
sou_all_sentences <- unlist(sou_sentences)

Scatterplots and more

In the preceding sections we’ve created scatterplots, line charts and bar charts. Those were all created with the ggplot2 package. Let’s refresh our memory about the code needed to build a bar chart, a scatterplot where text replaces points, and a more traditional scatterplot.

Remember we can customize the axis labels, theme, titles, captions and subtitles in ggplot2. For tips on changing the color scheme, read through this documentation and know that a little Googling goes a long way for troubleshooting your visualizations in R.

Tip: hrbrthemes is a great package with several very nice built-in themes. Other packages exist to mimick FiveThirtyEight and The Economist. I’m a big fan of the BBC’s R cookbook which walks users through creating BBC-style graphics. Having trouble with overlapping labels? Try the ggrepel package.

# A bar chart

sou <- read_csv("sou.csv")
## Parsed with column specification:
## cols(
##   link = col_character(),
##   president = col_character(),
##   message = col_character(),
##   date = col_date(format = ""),
##   text = col_character()
## )
## Warning: 20 parsing failures.
## row  col           expected actual      file
## 232 text delimiter or quote      I 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      D 'sou.csv'
## 232 text delimiter or quote        'sou.csv'
## 232 text delimiter or quote      C 'sou.csv'
## ... .... .................. ...... .........
## See problems(...) for more details.
sent_by_president <- sou %>%
  unnest_tokens(word, text) %>%
  inner_join(afinn, by = "word") %>%
  group_by(president) %>% 
  summarise(avgscore = mean(score)) %>%
  arrange(desc(avgscore))

ggplot(sent_by_president, aes(reorder(president, avgscore), avgscore)) +
  geom_col() +
  coord_flip()

# A scatterplot where text replaces points 

Trump <- read_csv("Trump_tweets.csv")
## Parsed with column specification:
## cols(
##   created_at = col_datetime(format = ""),
##   screen_name = col_character(),
##   text = col_character(),
##   source = col_character(),
##   retweet_count = col_double(),
##   favorite_count = col_double()
## )
Trump_adj_sent <- Trump %>%
  unnest_tokens(word, text) %>% # tokenize the headlines
  anti_join(stop_words) %>%
  inner_join(parts_of_speech) %>% # join parts of speech dictionary
  group_by(word) %>% 
  filter(pos == "Adjective") %>% 
  count(word, sort = TRUE) %>%
  inner_join(labMT, by="word") %>%  # add in sentiment dictionary
  glimpse()
## Joining, by = "word"
## Joining, by = "word"
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Observations: 282
## Variables: 3
## Groups: word [282]
## $ word  <chr> "american", "left", "bad", "fake", "radical", "united", "m…
## $ n     <int> 26, 21, 16, 16, 15, 15, 11, 10, 10, 10, 9, 9, 9, 9, 8, 8, …
## $ score <dbl> 6.74, 4.64, 2.64, 2.90, 4.58, 7.32, 4.78, 3.65, 6.32, 4.42…
ggplot(Trump_adj_sent, aes(n, score, color = score>5)) +
  geom_text(aes(label=word), check_overlap = TRUE) +
  theme_minimal() +
  scale_color_manual(values=c("#c63b3b", "#404f7c")) +
  theme(legend.position = "none")

# A scatterplot with regression lines and title, subtitle and caption

ggplot(final_pivot, aes(y=percent_of_vote, x=avgscore, color=party)) + 
  geom_point(aes(size=followers_count)) + 
  scale_size(name="", range = c(1.5, 8)) +
  geom_smooth(method="lm", se = FALSE) + 
  labs(title = "As they go low... we go lower?",
       subtitle = "Democrats with more negative tweets won more often in 2018",
       caption = "Source: Twitter") +
  scale_color_manual(values=c("#404f7c", "#34a35c", "#34a35c", "#c63b3b")) +
  xlab("Average sentiment of tweets") +
  ylab("Percent of vote in 2018 midterms") +
  theme_minimal()
## Warning: Removed 7 rows containing non-finite values (stat_smooth).
## Warning: Removed 7 rows containing missing values (geom_point).

Activity

For one final activity in text mining, let’s explore a dataset of 130,000 wine reviews from WineEnthusiast.com, which I downloaded from Kaggle. Below, I’ve ingested the data and filtered for only French wine reviews. Next, I’ve kept some of the more interesting variables (like points, price, province and variety) and tokenized the reviews and removed stop words.

You now have 444,000 words organized by price, points, variety and province in France. Your mission is to use one of the methods we’ve learned in this course to continue analyzing the text of these wine reviews. Take 20 minutes and return to the group chat with an insight from this dataset. If you wish to share one to three bullet points and a visualization through a Google doc, you can access that here.

Note: You can access the document either with your Google Account or anonymously (likely through an Incognito or Private type of browser session). A Google Account is not required. If you choose to authenticate with your Google Account, your username and/or other identifying information could be visible to others.

reviews <- read_csv("wine-reviews.zip")
## Multiple files in zip: reading 'winemag-data-130k-v2.csv'
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   country = col_character(),
##   description = col_character(),
##   designation = col_character(),
##   points = col_double(),
##   price = col_double(),
##   province = col_character(),
##   region_1 = col_character(),
##   region_2 = col_character(),
##   taster_name = col_character(),
##   taster_twitter_handle = col_character(),
##   title = col_character(),
##   variety = col_character(),
##   winery = col_character()
## )
glimpse(reviews)
## Observations: 129,971
## Variables: 14
## $ X1                    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
## $ country               <chr> "Italy", "Portugal", "US", "US", "US", "Sp…
## $ description           <chr> "Aromas include tropical fruit, broom, bri…
## $ designation           <chr> "Vulkà Bianco", "Avidagos", NA, "Reserve L…
## $ points                <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87…
## $ price                 <dbl> NA, 15, 14, 13, 65, 15, 16, 24, 12, 27, 19…
## $ province              <chr> "Sicily & Sardinia", "Douro", "Oregon", "M…
## $ region_1              <chr> "Etna", NA, "Willamette Valley", "Lake Mic…
## $ region_2              <chr> NA, NA, "Willamette Valley", NA, "Willamet…
## $ taster_name           <chr> "Kerin O’Keefe", "Roger Voss", "Paul Gregu…
## $ taster_twitter_handle <chr> "@kerinokeefe", "@vossroger", "@paulgwine …
## $ title                 <chr> "Nicosia 2013 Vulkà Bianco  (Etna)", "Quin…
## $ variety               <chr> "White Blend", "Portuguese Red", "Pinot Gr…
## $ winery                <chr> "Nicosia", "Quinta dos Avidagos", "Rainsto…
french_reviews <- reviews %>%
  filter(country == "France") %>% # Filter for only wines from "France"
  glimpse()
## Observations: 22,093
## Variables: 14
## $ X1                    <dbl> 7, 9, 11, 30, 42, 49, 53, 63, 65, 66, 69, …
## $ country               <chr> "France", "France", "France", "France", "F…
## $ description           <chr> "This dry and restrained wine offers spice…
## $ designation           <chr> NA, "Les Natures", NA, "Nouveau", "Nouveau…
## $ points                <dbl> 87, 87, 87, 86, 86, 86, 85, 86, 86, 86, 86…
## $ price                 <dbl> 24, 27, 30, NA, 9, 14, 15, 58, 24, 15, 55,…
## $ province              <chr> "Alsace", "Alsace", "Alsace", "Beaujolais"…
## $ region_1              <chr> "Alsace", "Alsace", "Alsace", "Beaujolais-…
## $ region_2              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ taster_name           <chr> "Roger Voss", "Roger Voss", "Roger Voss", …
## $ taster_twitter_handle <chr> "@vossroger", "@vossroger", "@vossroger", …
## $ title                 <chr> "Trimbach 2012 Gewurztraminer (Alsace)", "…
## $ variety               <chr> "Gewürztraminer", "Pinot Gris", "Gewürztra…
## $ winery                <chr> "Trimbach", "Jean-Baptiste Adam", "Leon Be…
french_reviews_tokenized <- french_reviews %>%
  select(description, points, price, province, variety) %>% # keep the columns we want
  unnest_tokens(word, description) %>% # tokenize the description column
  anti_join(stop_words)  # remove stopwords 
## Joining, by = "word"
glimpse(french_reviews_tokenized) 
## Observations: 444,013
## Variables: 5
## $ points   <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87,…
## $ price    <dbl> 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 27, 27, 27,…
## $ province <chr> "Alsace", "Alsace", "Alsace", "Alsace", "Alsace", "Alsa…
## $ variety  <chr> "Gewürztraminer", "Gewürztraminer", "Gewürztraminer", "…
## $ word     <chr> "dry", "restrained", "wine", "offers", "spice", "profus…